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1  Introduction 


Time  series  data  is  abundant  in  many  domains  ineluding  finanee,  weather  foreeasting, 
epidemiology,  and  many  others.  Large  seale  bio-surveillanee  programs  monitor  status  of 
publie  health  against  adverse  events  sueh  as  outbreaks  of  infeetious  diseases  and 
emerging  patterns  in  publie  health.  They  rely  on  data  colleeted  throughout  a  health 
management  system  (hospital  reeords,  health  insuranee  companies  records,  lab  test 
requests  and  results,  issued  and  fdled  prescriptions,  ambulance  and  emergency  phone 
service  calls,  etc)  as  well  as  outside  of  it  (school/workplace  absenteeism,  sales  of  non¬ 
prescription  medicines,  etc).  The  key  objective  is  to  as  early  as  possible  and  as  reliably  as 
possible  detect  such  changes  in  statistics  of  the  data  sources  which  may  be  indicative  of  a 
developing  public  health  problem.  One  of  the  challenges  the  users  of  such  systems  face  is 
that  of  data  overload.  The  actual  number  of  e.g.  daily  transactions  of  drug  sales  in 
pharmacies  across  a  sizable  country  may  be  very  large.  The  users  need  tools  to  enable 
timely  analysis  of  those  massive  data  sources.  The  analyses  can  be  performed 
automatically  (using  data  mining  software),  however  patterns  discovered  that  way  are 
almost  always  subject  to  a  careful  scrutiny  through  a  manual  drill-down.  In  both 
scenarios,  massive  screening  of  very  large  collections  of  data  must  be  executed  really  fast 
in  order  to  make  these  bio-surveillance  systems  useful  in  practice.  A  saving  of  just  a  few 
hours  in  detection  time  of  an  outbreak  of  a  lethal  infectious  disease  can  yield  enormous 
monetary  and  social  benefits  [3],  [10]. 

Most  of  the  kinds  of  data  mentioned  above  can  be  interpreted  as  time  series  of 
interval  (e.g.  daily)  counts  of  events  (such  as  number  of  certain  type  of  drugs,  e.g.  anti- 
diarrheals  sold;  number  of  patients  reporting  to  emergency  department  with  specific 
symptoms,  etc).  These  time  series  can  be  sliced-and-diced  across  multiple  symbolic 
dimensions  such  as  location,  gender  and  age  group  of  patients,  and  so  on.  Computational 
efficiency  of  mining  operations  which  one  may  want  to  apply  to  such  data,  as  well  as  the 
efficiency  of  accessing  interesting  information  in  a  manual  drill-down  mode,  heavily 
depend  on  the  efficiency  of  extraction  of  series  of  counts  aggregated  for  specific  values 
of  these  categorical  dimensions. 

This  report  introduces  a  new  data  structure  designed  to  dramatically  decrease  the  time 
of  retrieval  of  such  aggregates  for  any  complex  query  which  combines  values  of 
categorical  variables.  It  achieves  its  efficiency  by  pre-computing  and  caching  responses 
to  all  possible  queries  against  the  underlying  temporal  database  of  counts  annotated  with 
sets  of  symbolic  labels,  while  keeping  memory  requirements  incheck. 

1.1  Importance  of  Ad-hoc  Queries  against  Temporal  Databases 

A  record  of  a  typical  transactional  database  contains  multiple  attribute-value  pairs.  In 
temporal  databases,  one  of  the  key  features  is  date  which  allows  ordering  entries  by  time 
and  creating  time  series  representations  of  the  contents  of  the  database.  Other  fields  may 
be  categorical  (symbolic)  or  real-valued,  or  have  a  special  type.  We  will  focus  on 
temporal  databases  with  symbolic  attributes  used  to  characterize  demographics  of  the 
individual  entries  (note  that  the  term  “demographics”  is  used  here  in  its  most  general 
sense).  The  databases  of  our  specific  interest  can  be  thought  of  as  records  of  transactions 
-  there  is  also  a  count  component  in  them.  Let  us  assume  that  all  the  attributes  of  the 
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transactional  dataset  are  symbolie  exeept  of  the  date  field.  Distinet  values  of  sueh 
demographic  attributes  are  demographic  values  or  properties.  E.g.  postal  eode  (ealled 
“zip  eode”  in  the  US)  is  a  demographic  attribute  and  “15213”  is  one  of  possible  values  it 
ean  assume.  Also  let  us  define  a  set  of  demographic  properties  (DPS)  to  be  a  set  of 
assignments  of  demographie  values,  one  per  eaeh  demographic  attribute  in  the  data.  E.g. 
in  a  dataset  eontaining  zip  code  and  gender,  the  partieular  eombination  of  its  values  sueh 
as  {15213,  Male}  is  a  DPS.  Most  databases  we  eneounter  in  praetiee  eontain  between  ten 
and  hundred  thousand  unique  DPSes. 

Often,  time  series  queries  extraet  data  eorresponding  to  eonjunetions  of  demographie 
attribute-value  pairs.  A  simple  query  is  restricted  to  have  only  one  value  per  demographie 
attribute  whereas  in  a  complex  query  eaeh  attribute  ean  take  multiple  values.  We  show 
below  examples  of  a  simple  and  a  complex  query  against  a  dataset  “ed  table”  which 
contains  demographic  attributes  such  as  zip  code,  symptom  and  age  group.  Note  that  the 
seeond  example  query  eovering  Pittsburgh  region  ean  eontain  a  couple  hundred  zip 
eodes.  Eaeh  simple  or  eomplex  query  may  involve  only  a  subset  of  all  available 
demographie  attributes  (alternatively,  the  attributes  not  represented  in  sueh  query  may  be 
plugged  in  with  the  eomplete  list  of  their  values:  it  would  be  equivalent  to  “Jo  not  care" 
symbol  put  next  to  that  attribute  name  in  the  seleet  statement). 

•  All  senior  patients  having  respiratory  complaints  (example  of  a  simple  query): 

SELECT  date,  count (*)  FROM  ed_table 

WHERE  age  group  in  (senior)  AND  syndrome  in  (respiratory) 

GROUP  BY  date; 

•  All  child  patients  in  Pittsburgh  region  with  fever  or  headache  (a  eomplex  query): 

SELECT  date,  count (*)  FROM  data 

WHERE  zip  code  IN  (15213,  15232,...  15237,  ...,  15215)  AND  symptom  IN 

(fever,  headache)  AND  age  group  IN  (child) 

GROUP  BY  date; 

Even  for  databases  with  all  demographie  attributes  indexed  separately,  simple  queries 
require  aggregation  at  run  time  whieh  eould  beeome  expensive.  The  number  of  just 
simple  queries  is  exponential  in  the  number  of  demographie  attributes  and  their  arities. 
The  number  of  possible  eomplex  queries  is  very  much  higher  than  that  as  the  exponential 
faetor  is  doubled.  It  is  thus  not  praetieal  to  pre-eompute  and  store  the  answers  to  all 
possible  queries  in  a  straightforward  fashion.  Database  management  (DBM)  systems 
often  provide  tools  to  pre-eook  materialized  views  and  use  stored  proeedures  whieh  help 
to  quiekly  answer  a  (usually  small  and  predefined)  subset  of  all  possible  eomplex  queries. 
Those  DBM  teehniques  fail  in  the  ad-hoe  seenarios,  where  the  user  queries  are  not  known 
in  advanee. 

We  are  aware  of  at  least  two  good  examples  of  praetieal  settings  in  whieh  ad-hoe 
querying  eapability  is  needed.  Eirstly,  take  a  database  used  by  publie  health  officials  for 
bio-surveillance.  These  offieials  perform  disease  outbreak  monitoring  on  a  daily  basis.  It 
often  involves  investigation  of  alerts  of  possible  problems  flagged  by  the  automated 
outbreak  deteetion  systems,  for  sueh  investigations,  the  health  offieials  need  to  exeeute 
large  numbers  of  complex  ad-hoe  queries  in  order  to  extract  data  needed  for 
interpretation  of  the  alerts  and/or  for  isolating  their  likely  eauses,  while  differentiating 
real  outbreaks  from  false  alarms.  Sueh  investigations  need  to  be  exeeuted  in  a  timely 
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fashion,  and  long  waits  for  data  extracts  are  not  acceptable.  Secondly,  automated 
statistical  analyses  or  data  mining  procedures  executed  against  such  databases  require 
access  to  a  large  number  of  different  projections  of  data,  if  they  are  to  comprehensively 
and  purposively  screen  for  possible  indications  of  public  health  problems.  Long  waits  for 
the  corresponding  extracts  may  (and  often  do)  render  such  systems  impractical  as  both  of 
the  example  usage  cases  heavily  rely  on  quick  responses  to  complex  queries.  The  users 
would  dramatically  benefit  if  caching  responses  to  all  possible  queries  in  advance  was 
available. 

Still,  data  cubes  typically  require  in  order  of  one  second  or  more  to  respond  to  each 
complex  query.  This  latency  is  a  substantial  inconvenience  to  the  users  who  want  to 
execute  multiple  queries  in  an  online  fashion;  data  cubes  are  far  too  slow  for  statistical 
analyses  requiring  execution  of  millions  of  complex  queries,  which  would  take  days  of 
processing  time. 

2  Related  Work 

Standard  approach  to  handling  ad-hoc  queries  in  commercial  databases  is  to  use  On-Line 
Analytical  Processing  (OLAP).  The  idea  relies  on  data  cubes:  cached  data  structures 
extracted  from  (usually  only  parts  of)  the  original  data  and  constructed  in  a  form  allowing 
for  fast  ad-hoc  querying  of  pre-selected  subsets  of  aggregated  data  [1].  For  the  sake  of 
brevity  we  do  not  review  the  details  of  OLAP  technology  here,  but  these  methods  are 
known  to  suffer  from  long  build  times  (typically  hours  for  the  databases  of  sizes  and 
complexities  similar  to  those  used  to  illustrate  results  in  this  report)  and  large  memory 
requirements  (causing  the  need  to  rely  on  high-end  database  servers).  Additionally,  as  we 
have  observed  empirically,  data  cubes  still  typically  require  in  the  order  of  one  second  or 
longer  for  responding  to  a  complex  query  on  the  datasets  which  we  tested.  Such  latency 
is  a  substantial  inconvenience  to  the  users  who  want  to  execute  multiple  queries  in  an 
online  fashion.  It  also  hampers  statistical  analyses  which  may  require  millions  of 
complex  queries.  That  could  take  days  of  processing  time  using  industry-standard  OLAP 
data  cubes.  More  details  and  illustrative  examples  can  be  found  in  [3]. 

There  are  different  types  of  implementations  of  data  cubes  based  on  how  counts  are 
stored  internally.  Relational  OLAP  (ROLAP)  stores  all  counts  in  the  data  cube  in  the 
form  of  a  relational  table.  Multi-dimensional  OLAP  (MOLAP)  directly  stores  the  counts 
as  a  multi-dimensional  array  (hence  the  response  to  any  simple  query  can  be  obtained  in 
constant  time  which  makes  them  faster  than  ROLAP  counterparts,  but  their  memory 
requirements  grow  exponentially  with  the  number  of  dimensions  in  data).  A  few  other 
popular  variants  of  data  cube  implementations  include:  Hybrid-OLAP  (HOLAP)  and 
Hierarchical  OLAP.  HOLAP  combines  the  benefits  of  ROLAP  (reasonable  memory 
requirements)  and  MOLAP  (faster  response  time).  Idea  of  HOLAP  follows  intuition  that 
the  data  cubes  built  at  higher  abstraction  levels  are  denser  than  those  build  at  the  lower 
levels,  and  they  often  use  MOLAP  at  coarser  levels  and  ROLAP  at  finer  levels. 
Hierarchical  OLAP  is  used  when  the  various  values  of  a  demographic  attribute  in  data  are 
connected  to  each  other  using  some  hierarchy.  For  e.g.  date  attribute  can  be  split  into 
year,  month,  or  day  and  then  aggregations  at  year  or  month  level  could  be  obtained  using 
data  cubes.  Irrespective  of  the  implementation,  the  goal  of  data  cubes  is  to  respond  to 
simple  and  complex  queries  against  large  databases  as  fast  as  possible.  With  growing 
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demand  for  data  mining  and  statistical  analyses  against  large  databases,  innovation  work 
has  been  focusing  on  improving  data  cube  performance  [7], [8], [9]. 

Data  cubes  are  closely  related  to  another  technology  which  originated  from  computer 
science  research:  Cached  Sufficient  Statistics.  Similarly  to  data  cubes,  cached  statistics 
structures  pre-compute  answers  to  queries,  but  they  are  designed  cover  all  possible  future 
queries  (not  just  pre-selected  subsets),  and  they  aim  at  efficiency  of  not  only  data 
retrieval,  but  also  their  (most  often  in-memory)  representations.  AD-Trees  are  very  good 
examples  of  such  data  structures. 

AD-Tree  [2]  (All-Dimensional  Tree)  is  designed  to  efficiently  represent  counts  of  all 
possible  co-occurrences  among  multiple  attributes  of  symbolic  data.  This  is  very 
important  in  many  scenarios  involving  statistical  modeling  of  such  data,  where  most 
operations  require  computing  aggregate  counts,  ratios  of  counts  or  their  products.  Quick 
access  to  counts  of  arbitrary  subsets  of  demographic  properties  is  essential  for  overall 
performance  of  analytic  tools  which  rely  on  them.  AD-Trees  have  been  shown  to 
dramatically  speed-up  notoriously  expensive  machine  learning  algorithms  including 
Bayesian  Network  learning  [2],  Empirical  Bayesian  Screening,  Decision  Tree  learning 
and  Association  Rule  learning  [5].  The  attainable  speedups  range  from  one  to  four  orders 
of  magnitude  with  respect  to  previously  known  efficient  implementations.  These 
efficiencies  are  attainable  at  moderate  memory  requirements,  which  are  easy  to  control. 
Moore  describes  in  [2]  the  details  of  the  structure,  its  construction  algorithm  as  well  as 
fundamental  characteristics  of  the  AD-Trees.  There  are  also  dynamic  implementations  of 
AD-Trees  [6]  which  help  grow  the  structure  on  demand  and  which  can  be  more  memory 
efficient  than  fundamental  implementations.  AD-Trees  are  the  best  of  the  existing 
solutions  to  symbolic  data  representation  when  it  comes  to  very  quickly  responding  to  ad- 
hoc  queries  against  large  datasets. 

3  T-Cube 

T-Cube  (Time  series  Cube)  is  an  in-memory  data  structure  designed  for  very  fast  retrieval 
and  analysis  of  additive  data  streams  such  as  e.g.  time  series  of  counts.  It  is  a  derivative 
of  the  idea  of  AD-trees.  We  show  how  the  AD-Trees  can  be  easily  and  efficiently 
extended  to  the  domain  of  time  series  analysis.  T-Cube  consists  of  two  main  components: 
D-Cache  and  AD-Tree  (note  that  because  AD-Tree  is  designed  to  work  with  symbolic 
data,  the  T-Cube  is  applicable  to  temporal  datasets  with  symbolic  demographic 
attributes). 

D-Cache  is  represented  as  a  two-dimensional  matrix  with  rows  containing  one  DPS  (a 
set  containing  one  value  per  each  demographic  feature)  and  columns  corresponding  to  the 
subsequent  values  of  the  time  variable.  Table  1  below  shows  an  example  structure  of  a  D- 
Cache  built  for  a  retail  transaction  dataset  whose  entries  have  two  demographic  attributes 
{color  and  size  of  a  t-shirt)  and  which  covers  ‘T’  days  of  sales  record.  The  result  of  any 
time  series  query  (simple  or  complex)  against  such  dataset  is  an  aggregated  time  series 
over  the  rows  (or  DPSes)  of  the  D-Cache  that  match  the  query.  Hence,  using  D-Cache, 
the  query  response  time  is  linear  in  the  number  of  distinct  DPSes  in  the  data.  It  can  be 
slow  if  the  number  of  rows  in  the  D-cache  is  large. 
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Table  1.  D-Cache  data  structure  for  the  example  retail  database. 


Dayi 

Day2 

Dayx 

(Red,  Small) 

10 

15 

20 

(Red,  Large) 

5 

1 

7 

(Blue,  Medium) 

7 

5 

9 

(Green,  Small) 

5 

4 

10 

(red, medium)  (green, medium) 


Figure  1.  Fully-developed  AD-Tree  for  the  example  retail  database. 


The  goal  of  using  the  AD-Tree  strueture  is  to  reduee  the  query  response  time  from 
linear  to  logarithmie  in  the  number  of  rows  in  D-Caehe.  We  build  the  AD-Tree  using  the 
rows  in  D-eaehe  that  eontain  the  different  DPSes  present  in  the  data.  Usually,  eaeh  data 
node  of  the  AD-Tree  [2]  eontains  a  single  eount  of  oeeurrenees  eorresponding  to  the 
particular  DPS.  In  T-Cube  representation,  the  data  node  consists  of  a  series  of  T  counts. 
This  time  series  is  the  result  of  summation  of  corresponding  counts  across  all  the  rows  in 
D-Cache  that  match  the  current  AD-Tree  data  node.  These  matching  rows  are  called  a 
leaf-list  for  the  corresponding  data  node.  As  we  go  from  the  top  to  the  bottom  of  the  tree, 
the  sizes  of  leaf-lists  decrease  and  in  fully  developed  trees  the  leaf  nodes  have  leaf-lists  of 
size  one  pointing  to  a  single  row  in  the  D-Cache.  Therefore  such  AD-Tree  combined  with 
the  D-Cache  store  responses  to  all  possible  atomic  time  series  queries  against  the 
underlying  database.  Figure  1  depicts  the  fully  developed  AD-Tree  representation  of  the 
t-shirt  database  shown  in  Table  1.  The  data  nodes  with  time  series  are  shown  as  circles 
with  shaded  bars  and  the  vary  nodes  over  demographic  attributes  are  depicted  with 
rectangles.  Note  that  answering  more  general  as  well  as  complex  time  series  queries  can 
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be  accomplished  very  quickly  by  navigating  the  AD-Tree  and  performing  simple 
arithmetic  operations  on  the  vectors  of  counts  pointed  to  by  the  traversed  data  nodes.  The 
attainable  efficiencies  are  analogous  to  those  reported  in  the  fundamental  Ad-Tree  paper 
[2]. 

4  Empirical  Evaluation 

4.1  Data 

We  tested  T-Cube  on  three  different  datasets:  two  real-world  and  one  synthetic.  All  the 
datasets  used  in  the  experiments  consisted  of  records  collected  over  the  period  of  one 
year,  with  counts  aggregated  daily.  The  datasets  vary  in  the  number  of  records  (volume), 
number  of  demographic  attributes  (dimensionality)  and  the  number  of  distinct  values  of 
each  attribute  (arity).  Even  though  the  first  two  datasets  are  related  to  the  domain  of  bio¬ 
surveillance,  their  size  and  characteristics  are  similar  to  data  that  can  be  found  in  other 
domains:  a  few  attributes  of  high  arity  (zip  codes,  business  names,  etc.)  and  many 
attributes  of  low  arity  (gender,  company  sizes,  etc.). 

4.1.1  Chief  Complaint  Data  (ED) 

Emergency  room  chief  complaint  (ED)  dataset  contains  hospital  emergency  room  patient 
visit  records  from  four  US  states  (PA,  NJ,  OH  and  UT).  Each  record  consists  of  the 
following  attributes:  visit  Date,  Syndrome,  patient’s  home  Zip  Code,  Age  Group,  Gender 
and  Count.  The  patient’s  home  Zip  Code  has  21,179  distinct  values.  Note  that  even 
though  the  involved  hospitals  are  located  in  one  of  the  four  states  listed  above,  the 
patients  come  from  all  over  the  country.  Attribute  Syndrome  represents  the  chief 
complaint  reported  by  the  patient  and  has  8  distinct  values  (such  as:  Respiratory, 
Gastrointesitnal,  Constitutional,  etc.).  Attribute  Age  Group  could  take  one  of  the  three 
values:  Adult,  Child  and  Unknown.  Similarly  attribute  Gender  could  also  take  three 
values:  Male,  Eemale,  and  Unknown.  The  dataset  had  approximately  3.4  million  records 
and  it  does  not  contain  personal  information  of  the  patients  as  well  as  any  means  of 
identifying  the  involved  individuals.  There  is  a  total  of  120,604  DPSes  in  the  ED  data. 

4.1.2  Over-The-Counter  Data  (OTC) 

Over-the-counter  (OTC)  data  contains  the  volumes  of  daily  medication  sales  collected  at 
more  than  10,000  pharmacies  throughout  the  U.S.  The  data  has  been  geographically 
aggregated  to  the  level  of  a  zip  code  in  order  to  preserve  the  privacy  of  the  individual 
store  operations.  Each  record  contains  the  following  attributes:  purchase  Date,  store  Zip 
Code,  medicine  Category,  sale  Promotion,  and  Count.  This  data  covers  8,1 19  distinct  Zip 
Codes.  Category  represents  the  class  of  medicine  (e.g.  cough/cold  remedies,  baby/child 
electrolytes,  etc.)  and  it  has  23  different  values.  Binary  information  about  the  occurrence 
of  store  promotions  on  medications  is  provided  in  the  attribute  Promotion  as  ‘Y’  or  ‘N’. 
Attribute  Count  represents  the  quantity  sold.  The  dataset  has  356,545  DPSes  in  it. 
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4.1.3  Binary  Synthetic  Data  (SYN) 

The  real  datasets  eneountered  in  our  researeh  have  only  a  few  demographie  attributes 
(ED  has  four  and  OTC  has  three).  In  order  to  evaluate  the  utility  of  T-Cube  in  a  more 
general  setting,  we  have  ereated  a  sparse  binary  synthetie  data  (SYN).  It  has  a  large 
number  of  attributes:  32.  They  inelude:  Date,  Count,  Zip  Code,  and  29  binary 
demographic  attributes.  Each  of  the  29  binary  attributes  were  95%  sparse,  i.e.  they  took 
the  value  of  ‘0’  95%  of  the  time  in  all  the  records.  Each  record  has  the  Zip  Code 
randomly  assigned  from  the  pool  of  10,000  values.  Attribute  Date  spanned  one  year. 
Attribute  Count  took  values  randomly  selected  from  the  range  of  5  to  10.  12  million 
records  have  been  generated  in  such  a  way,  resulting  in  4.5  million  DPSes  in  the  data. 

4.2  Results 

This  section  reviews  the  experiments  performed  to  empirically  evaluate  performance  of 
T-Cube.  Eirst,  we  discuss  the  T-Cube  building  time  and  memory  utilization.  We  then 
present  ways  of  controlling  exponential  memory  requirements  of  the  underlying  AD-Tree 
structure:  attribute  ordering,  controlling  tree  depth,  and  the  use  of  efficiencies  stemming 
from  a  special  treatment  of  the  most-common-values  of  demographic  attributes.  The 
effects  on  performance  have  been  measured  with  regard  to  simple  queries  and  using  a 
system  with  AMD  Opteron  242  Dual  processor  CPU  (1,600  MHz)  and  16GB  of  memory. 
The  system  was  running  on  CentOS  4  x86_64  operating  system. 

4.2.1  Build  Time  and  Memory  Utilization 

Eigure  2  shows  the  D-Cache  and  AD-Tree  build  times  for  all  three  datasets.  The  build 
time  for  D-Cache  ranges  between  approximately  3  and  15  minutes  whereas  it  only  takes  a 
few  seconds  to  build  the  corresponding  fully  developed  AD-Trees.  It  was  not  possible  to 
build  an  AD-Tree  for  SYN  dataset  because  it  required  a  large  amount  of  memory  which 
exceeded  the  capacity  of  the  test-bed  system.  These  results  indicate  that  most  of  the  time 
needed  to  build  a  T-Cube  is  spent  on  building  the  D-Cache,  and  most  of  that  time  is  spent 
accessing  the  data  from  the  disk.  Once  all  the  DPS  time  series  derived  from  data  are 
stored  in  the  D-Cache,  construction  of  the  AD-Tree  is  fast.  Since  the  D-Cache  structure  is 
build  only  once  for  a  given  data  set,  we  only  pay  the  price  once,  upfront. 
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Figure  2.  Build  time  of  D-Cache  and  AD-Tree. 
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Figure  3.  Memory  utilization  by  D-Cache  and  AD-Tree. 


Table  2.  Summary  of  building  time  and  memory  utilization 
(with  use  of  fully  developed  AD-Trees). 


Building  Time  (secs) 

Memory  (MB) 

D-Cache 

AD-Tree 

D-Cache 

AD-Tree 

ED 

124 

16 

36 

676 

OTC 

944 

26 

77 

1116 

SYN 

913 

- 

808 

- 

Figure  3  depicts  memory  utilization  of  D-Cache  and  AD-Tree  for  all  three  datasets. 
Here,  the  AD-Trees  are  much  more  resource  hungery  than  the  D-Caches.  For  instance  to 
represent  the  ED  dataset,  we  need  only  36MB  of  memory  to  store  D-Cache  but  the  fully 
developed  AD-Tree  requires  676MB  space  (approximately  20  times  more).  This  is  due  to 
the  fact  that  the  memory  requirement  of  AD-Tree  is  exponential  in  the  number  DPSs 
present  in  the  data.  Again,  for  SYN  dataset,  AD-Tree  required  more  than  16GB  of 
memory  and  hence  the  corresponding  result  was  not  available.  Note  that  the  demand  for 
memory  is  proportional  to  the  length  of  the  represented  time  series  with  respect  to  both 
D-Caches  and  AD-Trees. 

Table  2  summarizes  the  memory  consumption  and  build  time  results  obtained  using 
basic  implementation  of  AD-Trees.  The  following  sections  review  ways  of  reducing 
demand  for  memory  at  the  cost  of  longer  query  response  times. 

4.2.2  Ordering  of  Demographic  Attributes 

The  AD-Tree  structure  is  inherently  unbalanced:  its  left  part  is  deeper  than  the  right  one. 
We  have  found  empirically  that  expanding  the  lack  of  balance  even  further  by  arranging 
the  demographic  attributes  in  decreasing  order  of  their  arity  leads  to  AD-Trees  with  lower 
numbers  of  nodes  in  total.  This  heuristic  will  make  the  attribute  of  the  highest  arity  the 
root  node  of  the  tree,  and  the  attributes  of  the  lowest  arities  will  end  up  deep  in  the  tree. 
Intuitively,  it  would  help  conserve  memory  by  preventing  the  attributes  with  high  arities 
from  being  represented  multiple  times  as  children  in  the  deeper  parts  of  the  structure. 
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According  to  this  heuristic,  we  would  arrange  the  ED  dataset  demographie  attributes  in 
the  following  order:  patient  home  Zip  Code,  Syndrome,  Age  Group,  and  Gender. 

Table  3  eompares  the  memory  requirements  resulting  in  use  of  the  above  deseribed 
heuristie  vs.  its  direet  opposite.  It  is  elear  that  arranging  attributes  in  decreasing  order  of 
their  arities  saves  memory  as  eompared  to  arranging  them  in  the  inereasing  order  of 
arities,  and  there  is  no  diseemable  effeet  on  the  query  response  time.  We  eould  save 
nearly  200MB  of  memory  to  represent  the  ED  data  while  using  140  thousand  less  AD- 
Tree  nodes.  However,  the  savings  were  not  suffieient  to  allow  for  representing  the  SYN 
dataset  within  the  16GB  of  memory. 


Table  3.  Effects  of  ordering  demographic  attributes. 


Memory  (MB) 

#nodes 

Increasing 

Decreasing 

Increasing 

Decreasing 

ED 

875 

676 

484k 

346k 

OTC 

1049 

1039 

565k 

552k 

SYN 

- 

- 

- 

- 

4.2.3  Controlling  Depth  of  the  Tree 

The  root  node  of  the  AD-Tree  represents  the  aggregate  time  series  that  matches  all 
eombinations  of  the  demographie  properties,  i.e.  the  leaf  list  of  the  root  node  is  as  large 
as  the  number  of  rows  in  the  D-Caehe.  A  eomplete  tree  keeps  growing  until  the  leaf 
nodes  represent  exactly  one  DPS  eaeh,  at  whieh  point  such  nodes  direetly  point  to  the 
individual  and  distinet  D-Caehe  rows.  The  data  node  leaves  in  the  fully  grown  AD-Tree 
will  have  leaf  list  size  equal  to  1 . 
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Figure  4.  Memory  utilization  at  different  r-values. 
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To  reduce  the  memory  requirement,  we  can  set  a  lower  bound  on  the  leaf  list  size  of 
all  data  nodes  in  the  tree,  call  it  r-value.  A  data  node  is  further  expanded  only  if  its  leaf 
list  size  is  greater  than  r-value.  Earger  r-values  will  correspond  to  smaller  trees  and  hence 
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lower  memory  usage.  Note  that  with  r-value  >  1,  the  AD-Tree  leaves  will  point  to 
multiple  rows  in  the  D-Caehe.  Henee,  queries  that  invoke  more  speeifie  time  series  will 
have  to  sequentially  sean  the  D-Caehe  rows  whieh  ean  be  done  in  linear  time,  and  it  will 
reduee  the  expeeted  speed  of  query  response.  Figures  4  and  5  depiet  memory 
requirements  and  query  response  times  for  different  r-values. 
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Figure  5.  Average  (simple)  query  response  times  for  different  r-values. 
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For  ED  data,  the  memory  requirement  falls  from  676MB  to  38MB  (almost  20  times) 
when  the  r-value  is  ehanged  from  1  to  100  (Figure  4).  Similar  effeet  ean  be  seen  for  OTC 
data.  Note  that  the  figures  do  not  show  results  for  SYN  dataset  as  it  still  needed  more  that 
16BG  of  memory  even  when  using  r-value  =  10,000. 


Table  4.  Sample  dataset  for  Figure  6. 


Date 

Gender 

Plaee 

Count 

2006/01/01 

M 

100 

4 

2006/01/01 

M 

300 

3 

2006/01/01 

F 

300 

1 

2006/01/01 

M 

200 

3 

2006/01/01 

F 

400 

2 

2006/01/02 

M 

200 

1 

2006/01/02 

F 

400 

4 

2006/01/02 

M 

300 

2 

2006/01/02 

F 

300 

5 

2006/01/02 

M 

200 

6 

2006/01/03 

M 

200 

2 

2006/01/03 

F 

300 

1 

2006/01/03 

M 

100 

4 

2006/01/03 

F 

300 

2 

2006/01/03 

F 

400 

3 
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Figure  5  shows  the  T-Cube  query  response  time  to  simple  queries  as  a  function  of  the 
tree  size.  Each  presented  result  is  the  average  response  time  over  1,000  randomly 
generated  simple  queries.  Note  that  changing  r-value  from  1  to  100  drastically  reduces 
memory  requirements  at  a  marginal  difference  in  response  time. 

4.2.4  Exploiting  Most  Common  Values 

It  is  possible  to  further  reduce  memory  requirements  by  exploiting  redundancies  in  the 
AD-Tree  structure.  Moore  and  Lee  [2]  have  shown  that  it  is  possible  to  remove  one  vary 
node  together  with  the  corresponding  sub-tree  from  under  each  dimension  nodes,  and  still 
be  able  to  recover  all  the  counts  which  were  represented  in  the  complete  tree.  Typically, 
the  greatest  possible  savings  in  terms  of  the  number  of  nodes  in  the  tree  can  be  attained  if 
the  sub-trees  removed  correspond  to  the  most  common  value  of  the  attribute  under 
consideration. 


(300)  =  (*,  *)  -  (100)  -  (200)  -  (400) 

(M,  300)  =  (300)  -  (F,  300) 

(F,  400)  =  (F,  *)  -  (F,  300) 


Figure  6.  Illustration  of  the  Most-Common- Value  method. 


Figure  6  illustrates  this  idea  using  an  AD-Tree  built  for  the  sample  dataset  shown  in 
Table  4.  The  nodes  (M*),  (F,  400),  (300)  and  everything  below  them  can  be  removed 
from  the  tree.  Any  query  that  requires  accessing  the  counts  from  the  removed  portions  of 
the  tree  can  be  served  by  re-computing  the  missing  numbers  using  the  data  stored  in  the 
remaining  nodes.  These  re-calculations  may  require  some  navigation  through  the  tree  and 
a  few  basic  arithmetic  operations  based  on  the  fact  that  the  sum  of  the  counts 


13 


corresponding  to  any  sub-tree  must  be  equal  to  the  eounts  attributed  to  that  sub-tree  by  its 
parent  node.  So,  to  retrieve  the  eounts  eorresponding  to  the  missing  sub-tree  one  only 
needs  to  subtraet  the  sum  of  the  eounts  attributed  to  the  existing  nodes  from  the  sum 
stored  at  the  root.  For  instanee,  the  eount  of  reeords  for  a  given  day  stored  in  the  dataset 
shown  in  Table  4  whieh  eorrespond  to  Place  =  300  can  be  restored  as  the  total  eount  of 
reeords  on  that  day  (whieh  is  stored  in  the  root  node  of  the  tree)  less  the  sum  of  the  same 
day  eounts  represented  in  the  nodes  eorresponding  to  Place  =  100,  Place  =  200,  and 
Place  =  400.  Having  figured  out  that  number,  one  ean  also  re-eompute  other  removed 
entries,  sueh  as  the  eount  of  male  patients  reporting  from  the  area  300,  (M,  300),  as  the 
differenee  between  the  eount  for  Place  =  300  and  the  eount  of  female  patients  from  that 
area,  (F,  300),  whieh  happens  to  be  represented  in  the  tree. 

So  indeed  it  is  possible  to  remove  one  ehild  from  eaeh  vary  node  and  not  lose  any 
information.  It  leads  to  substantial  savings  in  memory  requirements  at  modest  inerease  in 
the  average  query  response  time.  The  extra  time  is  required  to  re-eompute  the  missing 
eounts  whenever  neeessary.  The  most-eommon-value  (MCV)  triek  results  in  immense 
memory  savings  for  datasets  with  demographie  attributes  having  skewed  distribution  of 
their  distinet  values  (sueh  as  SYN  dataset).  It  may  not  be  as  benefieial  when  the  values  of 
demographie  attributes  are  distributed  more  uniformly.  For  instanee,  if  the  MCV  triek  is 
applied  to  a  vary  node  with  10,000  ehildren  eaeh  having  an  equally  long  leaf  list,  then  we 
will  have  to  add  10,000  time  series  at  run  time  to  respond  to  the  MCV  ehild  query,  whieh 
ean  be  eomputationally  expensive.  In  order  to  eontrol  sueh  effeets,  we  introduee  a 
parameter  ealled  MCV  fraetion  (y)  that  is  the  ratio  of  the  leaf  list  size  of  the  MCV  node 
and  the  leaf  size  of  its  parent  vary  node.  The  T-Cube  algorithm  will  apply  the  MCV  triek 
only  if  the  observed  ratio  is  higher  than  the  MCV  fraetion  threshold.  This  parameter  helps 
to  balanee  the  memory  savings  against  the  average  query  response  time.  For  y  >  1.0,  the 
MCV  triek  will  never  be  used.  And  for  y  =  0.0,  the  MCV  triek  will  be  used  always. 

Figure  7  summarizes  the  observed  memory  savings  for  different  values  of  y.  With  the 
MCV  triek,  we  ean  finally  fit  the  T-Cube  for  SYN  dataset  into  the  main  memory  of  the 
test  maehine  (reeall  that  its  29  binary  attributes  are  95%  pereent  sparse  and  they  have  0  as 
the  MCV).  It  requires  only  about  100MB  of  memory  as  eompared  to  more  than  16GB  it 
needed  before.  ED  and  OTC  datasets  also  show  savings  in  memory  but  not  as  signifieant 
as  using  the  tree-depth  method.  This  is  beeause  the  values  of  demographie  attributes  are 
almost  uniformly  distributed  in  those  data  sets. 

Figure  8  shows  the  effeet  of  varying  y  on  the  average  simple  query  response  time 
(depieted  at  a  logarithmie  seale  in  the  graph).  The  response  times  against  SYN  dataset  are 
the  slowest  (~50  milliseeonds),  beeause  most  of  the  randomly  generated  queries  involve 
MCV  values  whieh  require  an  extra  time  to  be  re-eomputed.  For  ED  and  OTC  datasets,  y 
=  0.4  seems  to  be  a  reasonable  value  to  piek  when  both  memory  and  response  time  are 
eonsidered. 
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Figure  7.  Memory  savings  attainable  with  the  MCV  method. 


y 

Figure  8.  Average  (simple)  query  response  time  when  using  the  MCV  method. 

4.2.5  Responding  to  Complex  Queries 

T-Cubes  can  be  built  in  minutes  even  for  large  data  sets.  Their  memory  requirements  ean 
be  eontrolled  and  traded  for  response  time  in  order  to  trim  them  to  the  manageable  levels 
(of  a  few  hundreds  megabytes  required  to  store  AD-Trees  for  all  of  the  eonsidered  three 
datasets).  Also,  the  average  response  time  to  a  simple  query  can  be  maintained  within  a 
fraetion  of  a  milliseeond. 

We  define  eomplex  queries  as  those  whieh  are  not  strietly  eonjunetive,  but  allow  eaeh 
demographie  attribute  to  take  multiple  values  simultaneously.  A  query  whieh  requests 
eounts  for  1,000  Zip  Codes  is  more  eomplex  than  the  one  whieh  only  requests  it  for  10  of 
them.  Let  us  define  eomplexity  of  a  query,  P,  as  the  fraetion  of  possible  unique  values  of 
an  attribute  whieh  ean  be  present  in  the  eomplex  query.  Values  of  P  belong  to  the  range 
of  [0,1],  and  the  higher  they  are  the  higher  complexity  of  a  query. 

Figure  9  shows  the  performanee  of  T-Cube  at  different  values  of  P  (the  results  are 
averaged  over  1,000  randomly  pieked  queries).  The  parameters  for  the  tested  datasets 
were  as  follows:  r-value  =  1,000  and  y  =  0.8.  The  observed  average  response  time  was 
within  100  milliseeonds  for  SYN  data  and  in  an  order  of  1  millisecond  for  ED  and  OTC 


15 


datasets.  The  results  indicate  that  T-Cubes  need  only  milliseconds  to  respond  to  even 
highly  complex  queries. 
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P 

Figure  9.  Average  response  time  to  complex  queries. 


4.2.6  Comparison  with  Commercial  Tools 

We  also  compared  performance  of  T-Cube  against  other  data  cube  tools  available 
commercially.  Due  to  privacy  concerns  we  do  not  list  the  names  of  these  tools  here,  but 
most  of  these  tools  are  commonly  used  in  many  practical  OLAP  applications  involving 
time  series  data.  The  data  used  for  these  tests  had  three  demographic  attributes  with 
arities  of  1,000,  10,  and  5  respectively,  12  million  records  of  daily  transactions  and 
covered  a  period  of  one  year.  The  experiments  were  performed  on  a  system  with  2.4  GHz 
CPU  ands  2  GB  memory,  running  Windows  XP  operating  system. 

Table  5  shows  the  results  of  the  comparison.  Each  of  the  commercial  tools  required  a 
different  amount  of  memory  to  represent  the  test  data,  however  for  all  tools,  the  response 
time  improved  with  the  increase  of  the  amount  of  used  memory.  Still  though,  the 
commercial  tools  require  seconds  to  respond  to  a  complex  query.  T-Cube  on  the  other 
hand  is  able  to  respond  in  milliseconds,  i.e.  1,000  times  faster  than  the  commercial  tools. 
The  two  versions  of  T-Cubes  (row  4  &  5)  differ  in  the  value  of  y,  the  MCV  fraction.  The 
first  one  uses  y  =  1,000  and  in  the  second  y  was  set  to  10  in  order  to  illustrate  the  trade¬ 
off  between  memory  consumption  and  response  time  attainable  with  the  T-Cubes. 


Table  5.  Performance  comparison:  T-Cube  vs.  commercial  tools. 


Query  Engine 

Type 

Memory 

Response  Time 

Tool  1 

RDMS 

330  MB 

6.8  sec 

Tool  2 

In  memory 

231  MB 

7.6  sec 

Tools 

In  memory 

I+GB 

3.5  sec 

T-Cube 

In  memory 

236  MB 

22  milliseconds 

T-Cube 

In  memory 

845  MB 

5  milliseconds 
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4.2.7  Early  Observations  from  Fielded  Applications 

The  authors  had  a  chance  to  see  T-Cubes  used  in  practice  as  enabling  technology  in 
applications  requiring  massive  screening  of  multidimensional  temporal  data.  These 
applications  include  systems  to  support  monitoring  of  food  and  agriculture  safety 
developed  at  the  US  Department  of  Agriculture  and  the  Food  and  Drug  Administration, 
as  well  as  a  system  to  monitor  maintenance  and  supply  data  operated  by  the  US  Air 
Force. 

One  of  these  projects  involved  a  data  base  consisting  of  datasets  each  with  about  25 
demographic  attributes  of  arities  varying  from  2  to  80,  about  12  thousand  records  of 
transactions,  covering  6  years  at  a  daily  resolution.  The  application  called  for  a  massive 
screening  through  all  combinations  of  attribute-value  pairs  of  size  =  1  and  2,  the  total 
number  of  such  combinations  exceeding  4.3  million.  The  involved  analytics  was  based  on 
an  expectation-based  temporal  scan  used  to  detect  unusual  short-term  increases  in  counts 
of  specific  aggregate  time  series.  The  total  number  of  individual  temporal  scan  tests  for 
one  such  data  set  exceeded  9.3  billion.  Each  such  test  involved  a  Chi-square  test  of 
independence  performed  on  a  2-by-2  contingency  table  formed  by  the  counts 
corresponding  to  the  time  series  of  interest  (one  of  the  4.3  million  series)  and  the  baseline 
counts,  within  the  current  temporal  window  of  interest  (one  of  2,000+)  and  outside  of  it. 
The  complete  set  of  computations,  including  the  time  necessary  to  retrieve  and  aggregate 
all  the  involved  time  series,  compute  and  store  the  test  results,  load  source  data  and  build 
the  T-Cube  structure,  etc.,  took  about  8  hours  of  time  when  executed  on  a  dual  CPU 
AMD  Opteron  242  1,600  MHz  machine,  in  the  64  bit  mode,  using  1MB  per  CPU  level  2 
cache  and  4  GB  of  main  memory,  running  under  Cent  OS  4  Linux  operating  system.  If 
the  users  used  one  of  the  commercial  database  tools,  the  time  needed  to  retrieve  a  time 
series  data  corresponding  to  one  of  the  involved  queries  would  approach  180 
milliseconds.  Therefore,  without  the  T-Cube,  it  would  take  about  9  days  to  just  to  pull  all 
the  required  time  series  data  from  the  database,  not  including  any  processing  or  executing 
statistical  tests.  That  kind  of  analysis  would  be  considered  infeasible  without  the 
efficiencies  provided  by  the  T-Cube  representation. 

5  Conclusions 

T-Cube  is  an  efficient  tool  for  representing  additive  time-series  data  labeled  with  a  set  of 
symbolic  attributes.  It  is  especially  useful  for  retrieving  responses  to  ad-hoc  queries 
against  large  datasets  of  that  kind.  We  showed  that  typically  a  T-Cube  performs  that  task 
around  1,000  times  faster  than  currently  available  commercial  tools.  This  efficiency  is 
very  useful  both  in  on-line  investigation  and  in  statistical  data  mining  application 
scenarios.  Rapid  aggregation  of  time  series  across  large  sets  of  data  made  possible  by  T- 
Cubes  becomes  an  enabling  capability  which  makes  manual  lookups  as  well  as  many 
complex  analyses  feasible.  T-Cube  can  be  used  as  a  general  tool  for  any  application 
requiring  access  to  time  series  data  from  a  database.  From  the  application’s  perspective  it 
is  transparent:  it  acts  just  like  the  database  itself,  but  an  incredibly  quickly  responding 
one. 

The  size  of  memory  used  by  T-Cube  can  be  finely  controlled  by  managing  the  tree- 
depth  or  using  the  MC  V  trick.  There  is  trade-off  between  memory  usage  and  the  average 
query  response  time,  and  allowing  more  memory  to  be  used  by  T-Cube  leads  to  shorter 
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response  times.  The  experiments  indieate  that  by  seleeting  the  right  set  of  parameters  one 
ean  aehieve  eonsiderable  memory  savings  at  only  linear  inereases  in  response  time.  The 
realistieally-sized  datasets  we  have  tested  so  far  were  manageable  in  that  their  T-Cube 
representation  would  fit  in  less  than  1GB  of  memory.  This  shows  that  T-Cubes  ean  be 
used  in  praetiee  to  speed  up  aeeess  to  large  sets  of  time  series  data  even  on  regular 
personal  eomputers,  alleviating  the  need  for  large  and  expensive  servers. 

T-Cubes  are  simple  to  setup  and  easy  to  use.  Typieally,  it  takes  only  minutes  to  build 
one  from  data.  Database  users  do  not  need  to  define  any  stored  proeedures,  or 
materialized  views  in  order  to  make  that  happen.  Onee  a  T-Cube  is  built,  it  is  ready  to 
respond  to  any  simple  or  eomplex  query.  The  power  of  quiekly  retrieving  any  ad-hoe 
query  makes  T-Cubes  potentially  very  useful  in  a  range  of  applieations,  ineluding  a  few 
speeilieally  known  to  the  authors.  They  have  a  potential  to  ehange  the  way  users  (people 
as  well  as  analytie  software  systems)  deal  with  the  time  series  datasets. 
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